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1. Introduction 


1.1 Antares 
of the Id - 
vith high- 
tecture 


memory, and can access 
t.1). CPUs coordinate 
connects to a NuBus 


e ideo subsystem. Higher 
video rates can be provided by using S, each driving a section of the 
screen. 


Antares (Figure 1.2) is a parallel p or comprising 4 independent and 
identical 32-bit Processing Units (PUs) whi h share an instruction cache and a data 
cache. An on-chip Memory Management Unit (MMU) performs virtual-to-real 
address translation, initiates and controls transfers between the CPU and local or 
remote memories, and handles inter-CPU messages. The MMU provides a flat 
(unsegmented) virtual address space of 1024 million words (4 gigabytes), and 
accommodates a real memory size of 64 million words. The instruction and data 
caches are identical: each has a capacity of 4096 bytes, organized as 64 lines of 16 
words (64 bytes). Antares caches are architecturally visible: instructions are 
provided to prefetch, create, flush, and invalidate cache lines. 


1A minimum system with monochrome display can be constructed with a single Antares CPU. 
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memory bus 


Figure 1.2. Major Elements of the Antares CPU Chip 
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| Store | Instruction 
Pipeline 


General 
Registers 


MULT. DIV. 


iagram of an Antares PU. While all four PUs 
Ice, they are otherwise independent. Each PU 
he, registers, and arithmetic 
uction stream. Each PU 
: R15), a ie set of 


hold data or addresses. Registers RO‘ 
format base plus displacement mode add 
register for branch and link instruct 
instruction set in which only load an 
which most instructions execute in one ¢ 


| sadcast and semaphore operations 
are provided to coordinate activities exe 


n different PUs. 


1.2 Parallel Processing 


The objective of the Antares design project is the development of a high- 
performance, single-chip CPU. Given a technology which will provide over a half 
million transistors on a chip, how can this "real estate" best be exploited to achieve 
this objective? The primary ingredients of a recipe for a fast, general-purpose, 
CPU are "big cache, small cycle time", so a large part of the available real estate is 
allocated for an on-chip cache (Figure 1.4b). To achieve a small cycle time, the 
processor (PU) implements a simple, general-purpose instruction set; also, for 
both cost and performance reasons, an on-chip Memory Management Unit (MMU) 
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CACHE - 2* bytes 


peenas Similarly, adding a graphics processor 
sment for only part of the system's workload. 
— three more, for a Ve of four, in the Antares 


pa times (4X) that of the 


e offers a potential perforr 
i 2X, no other design 


essor; even if the aver: 
alternative offers a comparable acro 
Antares software development is to rea 


Antares programs can execute in : 
These modes are categorized using (with son ties) the taxonomy developed 
by Flynn [1972]. : 


SISD (single instruction stream, si data stream). This mode is uni-, or 
serial, processing: only one PU executés.: Antares typically alternates between 
intervals of serial and of parallel processing: a single PU initiates (and often 
participates in) a set of parallel computation activities, and later may accumulate the 
results of these activities. 


SIMD (single instruction stream, multiple data streams). This mode 
corresponds to the usual view of parallel processing: each PU executes the same 
operation on different data streams, as illustrated in Figure 1.5, or on different 
elements of the same data stream. Data access may be ordered or random. In 
ordered access, inter-PU coordination is implicit, as when each PU operates on 
every fourth element of a vector. In random access, explicit inter-PU coordination 
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Wii 


3 5M‘ Wb 


SS 


KX 


te 


PU 0 operates on A[1] - Af 
PU 1 operates on A[65] - A[# 
or PU 0 operates on A[1], A[5}, . 


emaphore mechanism, 
parallelism to exploit, 


either with assembly code or by a co: 
to "unwind" a loop which operates @f:an arr. 
operating on every 4th array element. Optimal | 
all PUs are doing the same work. 


Pon 4 PUs, with each PU 
ce is easily obtained, since 


As an example of SIMD mode K 
transformation operation (used in scaling 
the 1 x 4 matrix multiplication 


onsider the common graphics 
on, and translation) which involves 


C11 #©12 13 «£14 
C21 22 23 24 
C31 ©32 °33 C44 
C41 C42 C43 C44 


[x* y* z* wt] = [x y x w] x 


where 
[x y x w] = original coordinate set, 


[x* y* z* w*] = transformed coordinate set, 
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ation. (For any 
m to be O or 1.) 
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‘particular transformation, s 


x* XC11 + Yoo. 
y* = x¢c12 + yce22 
z* = *XC13 + YCo3 
w* = xC1q4 + yoo4 + 


f this transformation, PU 0 can be 
, and so on. Each PU preloads its 


each PU executes a cache prefetch instruction (Section 3) to prefetch the next line of 
coordinate data. (Only one prefetch actually takes effect.) By careful scheduling of 
prefetch and computation operations, very high transformation rates can be realized. 


MISD (multiple instruction streams, single data streams). In this mode, each 
PU executes a different operation on the same data stream element; data is 
"pipelined" between PUs (Figure 1.6). For example, consider the computation of 


y = ax? + bx* + cx +d 
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which might be divided across PUs as follows. 


« PU 0: read x, compute = = cx + d,storex,cx + d 
> 22, bx2 
2), compute and store ax3 


ta semaphores: for example, one 
semaphore might be used to ®.PU 1, another to transmit x2 


from PU 1 to PU 2, and soo 


and it is not trivial to bal 
Interpretive programs, 


timize performance. 
preter or a 68000 


can be eased somewhat by the use of. 
which can be executed in parallel. 


parallel activity boundaries. At} 
code on Antares is not expected to cros re boundaries; execution will be 
serialized (constrained to SIMD mode) ure call and return points, so that 
the compiler will not have to maintain multiple stacks. It is possible that certain 
exceptions may be made (e.g., independent, non-recursive, leaf procedures ident- 
ified by compiler directive), and critical graphics system and operating system 
operations may be hand-coded to obtain maximum performance. No explicit 
support is provided for multi-tasking within an address space (i.e., "light-weight" 
processes). However, a user state task can execute in parallel with the kernel; an 
external interrupt will be assigned to an idle PU, if available, so that processing of 
the interrupt can be done while a user task continues in execution. 


arallel execution of compiled 
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Antares CPU 
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instructions for: 
statement a 


errrrrrrrrrrrrrret | i 


Ss 


SQN 


each PU operates on 
different data 


mputing Capability” in 
slotnick of Illiac IV fame) 
i that debate, Gene Amdahl 
“with two modes of operation, 
inated by the low speed mode. 


one high speed and one low speed, w 
These modes can correspond to vector 
computer, or parallel and serial sequen 


ar Operation sequences on a vector 
parallel computer. This postulate 
has come be called "Amdahl's Law". A ver 


readable and entertaining discussion 
of Amdahl's Law is presented by Worlton'f 1981]. 


To illustrate, suppose that a workload executes in time T on a single Antares 
processor (Figuré 1.8a). Assume that one-half of this workload can be parallelized 
to run on 4 PUs, so that execution of this part of the workload is speeded up by a 
factor of four; the execution time of the other half of the workload is unchanged 
(Figure 1.8b). The workload execution time reduces to 5T/8, which represents an 


2This came to be called "the great debate", and was one of the early skirmishes between the unis 
and the multis. 
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¢ Parallel processing can be ‘éffective, <« if only part of the 
workload can be parallelized, at e-critical. In the Id 
system, for example, the perfo a ed via parallelization of 
the graphics pipeline is crucial i graphics requirements 


rely on explicit parallelization to 


optimize performance of key components ; some components will be coded i in 


code sections to the compiler. These fnethieds will supplement the parallelization 
done implicitly by the compiler. Improved parallelization should be achieved as 
experience is gained in exploiting parallelism in software design and as compiler 
technology evolves. Thus, continuing gains in performance are expected over time. 
Note the improvement obtainable if the serial part of the workload of Figure 1.8 
could be speeded up by just a factor of two ...... 
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2. The Instructi 


2.1 Instruction Set Design 


execution control, help reduce cycle 
ith caches and MMU, on a single se 


instructions are 32 bits in length. ° 
Size operations (such as 


rate (or, equivalently, 
chmark comparison of the 
: the static code size for the 
he 68020, and the instruction 
‘MIPS processor used only 1/4 as 


performance by increasing the instr: ti 
instruction bandwidth). Hennessy [19 
Stanford MIPS processor and the Mo 
MIPS processor was 40% greater than 
bandwidth was 20% greater. (Howeve 
many cycles as the 68020. ) Instruction de 


executing the same code. Analysis of static and dynamic instruction frequencies 
shows that a 32-bit instruction length is longer than necessary for many of the most 
frequent instructions. A significant improvement in instruction density can be 
achieved by using a 16-bit instruction length or by variable-length instructions (for 
example, the Fairchild Clipper has 16-, 32-, 48-, and 64-bit instructions). A 16-bit 


1Reduced Instruction Set Computer: see, for example, Hennessy [1984], [1985], or Patterson 
[1985] | 
2Complex Instruction Set Computer 
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standard instruction format was chosen for Antares because of the hardware cost 
and complexity of handling variable-length instructions. Antares does provide 
extended format load and store instructions in base plus displacement addressing 
mode to reduce synthesis costs in certain addressing situations. 


struction format, and for the choice of 
_ synthesis of higher-level opera - 
ithmetic instructions (unfeasible 


A second and related motive for 
a number of Antares instructions, ii 


ediates, displace - 
set design and 
», because of its 


ments, and direct address 
encoding. The Antares i i 5 
encoding and because pie ei and immediate fields vary in 
their use. nove field sizes have been selected to best match function and 


t 
8 
i 
© 
Bf 
a 
o 
er) 
o 
bat 
Ou 
baal 
eZ) 
z 
p. 
a 

b 
Ee 
< 
. o 
EF 
ba) 
ct 


re in the range 0-16, ‘about 95% are in the range 
arge proportion of the immediates in the range 


re s static local variables, or 
addressing. This improves 
instruction stream density and facilitate Hp. Similar arguments apply to 
the (conditional) branch displacement + 256 instructions, to the data 
address displacement of 64 words) fo d-format base plus displacement 
addressing, and to other instruction set patameters. To the greatest extent possible, 
instructions are designed so that high- frequ ncy operations can be executed with a 
single instruction: lower-frequency operations are synthesized by instruction 
sequences which, because of the short instruction length, tend to require relatively 


small amounts of instruction space. 


3HP uses the term "precision" to describe the tightly-encoded functional architecture of the HP 
Spectrum line (Birnbaum and Worley [1985)). 


4See, for example, Hennessy et al [1982] 
SA choice based on, among other factors, stack size frequency distributions 
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Mask Register 


So i 
Register 


Status Save Register* 


PC Save Queue” 
(FIFO pair) 


atus/Control Register** 
@gram Counter (PC) 


Link Register 


Special, Contr Status Registers 


“privileged 


**contains both pri 
non-privileged fields: 


.1. PU (Local) Registers 


~ 


ith a flat (unsegmer 


discussed in Section 5.) Each PU 
broadcast operations, can be accessed Gn 
registers which are accessed by all PUS.: 


The local registers of a PU are sh 
purpose registers, RO-R15. All transf 
general register load and store instructia egisters RO-R3 can be used as base 
registers in standard-format base plus dis ment addressing; all 16 registers can 
be used as base registers in extended-format base plus displacement addressing. 
Register R4 is used as the link, or return address, register by jump and link 
instructions. -.- - 


@ure 2.1. There are 16 general- 
and to memory are performed by 


In addition to general registers, each PU has a set of special, control, and status 
registers. There are seven local special registers (SO-S3 and S7-S9) and eight 
global special registers. Special register contents can be read and, in certain cases, 
written, by Move Special instructions, which transfer data between general and 
special registers. Values also may be written to special registers as the result of 
executing other instructions. Access to some special registers is privileged, and can 
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be effected only in system state. A brief summary of the functions of the local 
special registers® follows; for details, see the Antares Instruction Set Manual. 


¢ Mask Register (SO). This regi 
bit field manipulation instructions 
by the mask (MSK) instruction. 


is an implicit operand register of 
set with the field position and length 


these registers. On 
hel executes a pair of 


PC Save Queue. 


Two PCs (current and next) and two | sare required because of the 
delayed branch instruction execution sussed later in this section). 


status and control information. 
* certain fields of interest in this 


The PU Status/Control Register cc 
Figure 2.2 shows this register and i 
overview. Mode bits control various a: of PU operation, and are set and 
cleared by Set Mode and Clear Mode instrt ctions. Some mode bits may be modi - 
fied by a PU operation; for example, execution of a trap instruction clears the user 
mode and trap enable bits. Antares can perform arithmetic on 8-, 16-, and 32-bit 
operands, so the condition code field contains 4 carry bits as well as Zero, 
Negative, and Overflow bits. Flags bits are set during certain PU operations. The 
register count field is used to store the register count of a load/store multiple 


6Special register lengths can vary according to function; the numbering of Special Registers may 
be revised to help decoding. 
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reg. 
count flags ~ cond. codes mode bits 


trap enable 
ser/system 


available 


Fig ~ PU Status/Control Re 


thode bit and the 


instruction interrupted by a pag ce. » fault. The "registers avail 


and Selection Registers 
s such as cache misses, ‘i nd PU utilization. The 


ach PU (user/system, 


2.3 Addressing and Addressin 


Instruction and data addresses in A 
addressing is in instruction-length units 
ment) represents a half-word increment 


are 32 bits in length. Instruction 
ative instruction address (displace - 
d an absolute instruction address (PC 


contents or absolute jump address) is a half-word memory address. Instructions 
are assumed be be aligned on half-word boundaries. Data addresses are word 
addresses for load and store word instructions, byte addresses for load and store 
byte instructions. There are three data addressing modes: register, base plus 
displacement, and direct. . 
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$4 Test Register 


S6[0]_ Semaphore Flag & 
se(1) Prefix Address Regs. 


with the prefix address from 
e cluster number in the PU 
: nstruction (Figure 2.4). The 
ess, S6[i] provides bits 8-29, and 


is formed by concatenating the 8-bit displacement 
special register S6[i], where i represen the Vv 
Status/Control Register of the PU e 
displacement field provides bits 0-7 o 
bits 30-31 are ignored. 


The prefix address defines the start of @ 256-word memory region which can be 
accessed by load and store direct instructions; this region is called direct address 
space. Separate direct address spaces are provided for user and system state, and 
both user and system can redefine their direct address spaces as desired. The first 8 
locations of direct address space are semaphore locations; semaphore operations 
are performed by load and store direct accesses to these locations (see Section 4.3). 
Semaphore flags are kept with the prefix address in an S6 register and, if desired, 
can be changed when the prefix address is changed. 
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Semaphore Flag & Prefix 
Address Register pair 


Cluster No. 
(O or 1) 


PU Status/Contro : 


displacement field 
rom_ instruction 


word address in mé 


Figure 2.4. Direct Address Generation 


_in the initial implementation of Antares, this 
it. The cluster number i is assigned a value (0 or 


kernel can change the cluster numbéf: access the user's direct 
address space. | 


instruction space). 


The use of three different address t ee word, half-word (instruction), and 


of Antares more difficult than would 


have performance advantages relative to ye addressing, and it is expected that 
most Antares programming will be done in a higher-level language; very few 
programmers will need to be aware of the different address types. One reason 
multiple address types are used is to make the most efficient use possible of the 
relatively small (in conventional view) immediate fields of Antares instructions. 
For example, if only byte addressing was provided, the 8-bit immediate field of the 
Add Immediate instruction would give an immediate range of only 1-64 for word 
increments, and the 4-bit immediate field of the Subtract Immediate instruction 
would give an immediate range of only 1-4 for word decrements. A second reason 
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is the elimination of index shifting for operations on word arrays which is required 
if byte addressing only is provided (e.g., @A[i] = @A[0] + iat 


bits 31 


bytes 0 


Bit 0 is the least signific 
data or the sign bit for si 
arithmetic mode specified. ) 


2.4 Instructions 


source register(s) 
base/address register 
destination register 

Mask register 


Reference Manual; only certain featur ' 


Operation of most of the general 
discussed in the preceding section. The 
to 1 if cc matches the current code and t 
logical expressions and helps mitigat 
condition codes usually impose. Antares f 


boolean condition) sets a register 
ise: it facilitates optimization of 


operations which operate concurrently on both half-words of a word (in addition to 
full-word arithmetic operations). A mode bit in the PU Status/Control Register 
determines if partial-word arithmetic operates on bytes or half-words. The 
condition codes in the PU Status/Control Register include four carry flags: 2 or 4 
of these may be set as the result of a partial-word arithmetic or compare instruction. 
The LDCP instruction is used to load carry flags, extended to the current operand 
width, into a general purpose register. Load and store multiple instructions are 
provided to help keep procedure call and return overhead low. (While the cost of 
synthesizing these operations is not high in cycles, it is in terms of cache space.) 
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. REGISTER LOAD, STORE, AND MOVE INSTRUCTIONS 


Lec ->Dst load boolean on condition 

LD Imm->Dst e i 
- LDB @Base->Dst 

LDCP ->Dst 

LDM _  @Base->Dst 

LDW _Dsp->Dst 

LDW @Base->Dst 
@Base+Dsp->Dst_ 


store multiple registers Sr 
store word (direct) 

store word (register) 

store word (base + displacement) 
store word (base + extended displac 


add register (bytes or halfwords with carries): 
add register (word) , 
add immediate 

: add register (word sare repel 


Sr1+Sr2->Sr1 
Sri+Sr2->Sr1 
Sr->Dst 
Sri/Sr2->Sr1 
Sr1*Sr2->Sr1 
Sr1*Sr2->Sr1 
Sr->Dst 
Sr1-Sr2->Sr1 
Sr1-Sr2->Sr1 
Sr-Imm->Sr 
Sr1-Sr2->Srl 
Sr1-Sr2->Sr1 


TRANSFER AND 


*+1+Sr->Sr 
Bcc *+Dsp 


CMP _— Sri-Sr2 compare register (word) 
CMP Sr1-Imm compare immediate 

~CMPP - Sr1-Sr2 compare register (bytes or halfwords) 
JMP *+Dsp jump relative 
JMP @Sr jump absolute 
JMPL @Sr jump and link (return address -> reg. 4) 
TSTF Sr test field under Mask 


Imm test mode bit number Imm 


Figure 2.5, Antares Instruction Set 
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SHIFT, LOGICAL AND FIELD MANIPULATION INSTRUCTIONS 


Sr1&Sr2->Sr1 

Sr1&—Sr2->Sr1 

Sr->Sr 

Sr->Dst 

Sr2,Sr1 

Sr->Dst 

Sr->Dst 

Sr->Dst 

Imm1, Imm2 

Sr, Imm 

~Sr->Dst__. not (one's comple: 
SrilSr2->Sr1 or : 
Sr->Sr set field under (Mask) 
Sr<<Amt->Sr shift left logical 
Sr>>Amt->Sr shift right logical 
Sr14Sr2->Srl : exclusive or 


CACHE CONTROL INSTRUCTIONS 


create data cache line 

flush data cache line 

invalidate data cache line 

invalidate instruction cache line 
invalidate all instruction cache lines 
prefetch data cache a 


BROADCAST AND CO 


PUmask 


@Base->Dst 
PUmask 
PUmask 


Sr->Dst of PUmask 
Imm 
@Base->* of PUmask Start halted P PUs specified by PUmask at (Base) 
_ Imm trap 
PUmask halt, or wait for PUs specified by PUmask to halt 


Figure 2.5 (continued). Antares Instruction Set 
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The address register of load multiple is incremented by the number of registers 
loaded, and the address register of store multiple is decremented by the number of 
register stored, to help in stack manipulation. 


ADCP, ADDP, MULP, SBCP, and 
half-word operands, depending on th 
mode bit; other arithmetic instru 
and divide instructions execut 
issued in the cycle following j 


structions operate on either byte or 
sting of the partial-word arithmetic 
n full-word operands. Multiply 
next sequential instruction is 
divide, and instruction issue 


16-bit i instruction format. The Mask register (local Special oe 
operand of Antares bit field manipulation instructions: onc 
fa field (by a MSK ptend extract, * , and test 


enalty (delay incurred as the result of a miss) are 
tformance. Antares: provides a set Ot instruc - 


PUmask field of the broadcast instructi 
Section 4.2. Set, clear, and test mode 
PU Status/COntrol Register. ITLB, R 
the cache and Translation Buffer in tasks: 
instructions, are discussed in Section 6. 


2.5 Nominal Instruction Execution 


The Antares instruction execution pipeline has four stages: fetch (F), decode 
(D), execute (E), and store (S). Four different instructions can be in different 
stages of execution in any cycle. A fifth stage, called store 2 (S2), is used in 
executing load and store instructions. An Antares PU issues one instruction per 
cycle unless the pipeline is blocked by a cache miss, a cache bank conflict or a 
pipeline interlock’ delay, a register wait condition (e.g., wait for multiply result), or 


TAIL Antares pipeline interlocks are hardware controlled (as opposed to, for example, the MIPS 
processors, which relies on the compiler to generate interlock-free code.) 
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execution of a synchronous multi-cycle instruction. In the absence of any of these 
conditions, most instructions execute at a rate of one instruction per cycle. While 
each instruction takes a minimum of four cycles, three cycles are overlapped by the 
execution of other instructions. The 1 al execution time of an instruction is 
defined as the number of non-overlapp cles required for its execution in the 
absence of delays. 


Most Antares instructions 
exceptions include load and sto tiple ins ns, load and branch instruc - 
tions, and the multiply and and store multiple are the 
only synchronous multi-c: ctions take one cycle for 


ited in a single cycle, 
jer instructions. 


effective execution time is one cycle. “(E the 
e the destination register, it is delayed (via a 


of this overview. 


8Several studies have shown that the branch shadow can be filled with a useful instruction at least 
70 percent of the time: see, for example, Gross and Hennessy [1982]. 
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3.1 Introduction 


of the 4 PUs: careful use of the cache, via 
fetching and by cache line management, can help 


transfers, discusses the I’ 
s.the instructions provided ft 


aches are identical: each has a 
res of 16 words (64 bytes). All 
ts of one line. The cache design 


independent instruction and data cach 
total capacity of 4096 bytes, organized 
transfers between the CPU and memory: 
is 4-way set associative: the 64 lines of e are grouped into 16 sets of 4 lines. 
Every memory line maps into one of thesé 16 sets, as illustrated in Figure 3.1. For 
cache access purposes, a virtual word address divides into a word index, which 
specifies one of the 16 words in a line, a set index, which specifies one of the 16 
sets of 4 lines, and a tag. When a line is stored in the cache, its address tag is 
stored in the tag store location which corresponds to that line. A set of flags also is 
stored with the tag, including LRU bits, a system/user bit, a valid bit, and a 
modified bit. 


1 Both are determined by technology; a key factor in determining cycle time is the cache access 
time. CPU cycle time frequently is determined by the taken branch path length, which includes a 
cache access for the branch target. 
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set index 
word index within line 


.) The tag (bits 8-29) is 
compared with the tags of the valid ti f a match is found, then a 
cache hit has occurred. Bits 0-3 are used ta: e word be fetched or stored, 
the LRU bits are updated, and the modi tif the access is a store. If none 
of the valid tags in the set match the tag ent access, a cache miss occurs. 
The LRU bits for the lines in the set are ¢xamined to determine which line is to be 
replaced. If the selected line is modified, then it has to be written to memory. This 
operation is called a moveout. To reduce the time the requesting PU must wait, the 
line being moved out is placed in a moveout buffer, the missing line is read into the 
cache from memory (moved in) and the requestor activated, and the line in the 
moveout buffer then written to memory. If the line being replaced is not modified, 
the missing line simply is read into its cache location. In either case, the memory 
read is initiated by sending a line missing request to the MMU. 


Physically, each cache is organized as four banks of 256 words, as shown in 
Figure 3.2. A 5 x 4 crossbar switch connects the PUs and MMU to the cache 
memory banks, and a four-ported tag store permits simultaneous cache access from 
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16-word moveout buffer 


iltaneous transfers between the cache and PUs or 
can be executed in a cycle: provided that eagh transfer uses a different 
hen multiple requests ai 
and the other requestors 


balance between the cost of the Saat 
performance penalty of bank conflict d 


Delays caused by cache bank conflic énd on the PU memory access rate, 
the number of active PUs, the cache miss:rate, and memory addressing patterns. 
Consecutive words of a line reside in different cache banks: words 0, 4, 8, and 12 
reside in bank O, words 1, 5, 9, and 13 reside in bank 1, and so on. In array 
operations, assigning each PU to operate on every fourth element of the array 
results in each PU accessing a different cache bank, substantially ee bank 
conflict delays. 


3.3 Cache Design Decisions 


The overall size of the Antares cache was determined by the available chip "real 
estate", and its partitioning into separate instruction and data caches was determined 
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by bandwidth requirements. These partitions were chosen to be equal in size 
because this is believed to be the best division for small caches2, and because it was 
desired to make both caches identical to reduce chip design effort. The physical 
division of each cache into four banks ased on a trade between the cost of the 
required crossbar switch and the perfo! penalty of bank conflicts. 


store-through, for performance 
ified. The same line may be 


The cache design is store-to 
reasons. Only lines in the data 


é aces in the cache into which lines can n be mapped. 
her hand, a startup s:associated with each miss, so that the maximum 
emory bandwidth 1 is le or smaller line Sizes. In the absence of con - 


, rieares, § S is expected to 
‘a maximum bandwidth 


A set size of 4 was chosen over a i 2 tp reduce the miss rate and 
to reduce the probability that the four ight rate a reference pattern which 
would cause set thrashing. 


A cache can be designed to be acc vith virtual addresses or with real 
addresses. In the latter case, address tr: ion must be done on every memory 
reference either before or, in some cases, in parallel with, the cache access. This 
adds some complexity to a single-processor design and can reduce performance by 
increasing cycle time (although the translation can be pipelined at the cost of 
additional complexity). In Antares, real addressing of the cache would require a 
translation mechanism capable of supporting four simultaneous translations, one 
for each PU. Consequently, the Antares cache is virtually addressed: address 
translation takes place on on miss processing, which substantially reduces the 


performance demands on the translation mechanism. 


2Davidson [1987] concludes that the optimal instruction cache size is about 50 percent of total 
cache capacity for most capacities. 
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0Xz...20 . ; OXz...ZF 


3.4 Cache Miss Timing 


When a PU accesses a word belongi: 
occurs, and the missing line is moved i memory; the MMU translates the 
virtual address of the line to a real address, reads it from memory, and stores it in 
the cache. If this line replaces a modified line, the latter must be moved out. To 
minimize the delay incurred by a PU on a miss, the modified line is written to a 
moveout buffer.(Figure 3.2), the missing line moved in, and the modified line then 
written to memory. Rotate and forward operations further reduce the PU's delay. 
The MMU initiates a line read from memory beginning with the word accessed on 
the miss, reads the remaining words in the line, and then wraps around to read the 
first part of the line, effectively rotating the line so that the referenced word is the 
first word read. For example, if an access to word 5 of a line results in a miss, 
word 5 is the first word read, as shown in Figure 3.3. When this word is read 
from memory, it is forwarded to the requesting PU at the same time it is stored in 


ine not in the cache, a cache miss 
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the cache, so the PU can continue execution without waiting for the movein to 
complete. Subsequent accesses to that line, however, must wait for movein 
completion. 


The delay incurred by a PU as the r 
factors, including cache bank conf. 
miss) delays. The timing involved 
miss and no delays, is shown i i 


f a cache miss depends on a variety of 
busy delays, and translation (TB 
. processing, assuming a demand 
scussed below. It is assumed 


ith word of a line, and require : ¢ e line. (Instruction cache 
miss timing is similar excep ‘ ¢ 


A PU issues a load or 
pipeline. In the following 
pipeline, the cache does a tas 
accessed word is loaded into of stored from a register in the fot 
corresponds to the S2 stage of the pipeline); otherwise, a mov 
requesting PU isi 


) stage of the PU 
tesent. If itis, the 
hg cycle (which 
initiated. The 


re busy 
have to be moved. At present, cache accesses 
ed to have aces priority than any other cache 


blocking E PU access during moveout 
the design process. 


In cycle 2, the MMU translates th addres 
Translation Buffer (TB) look-up. At : 
access for the first word to be moved in: 
first word of the line. The access time fc tirst word is assumed to be equiv - 
alent to 5 cycles; it depends on RAM chi ess time and the Antares cycle time. 
Subsequent words can be read without further delay. 


ie line being moved in via a 
is cycle, it initiates memory 
'1— which is not necessarily the 


In cycle 8, word i is returned and, simultaneously, stored in the cache and 
forwarded to the requesting PU (which resumes execution, having incurred a delay _ 
of 8 cycles). The remaining words of the line are stored during cycles 9-23. The 
line is marked valid at the end of cycle 23 and is available for PU access in cycle 
24. Store requests for words of the line being moved in are made by the MMU; if 
access to a cache bank is requested in the same cycle by both the MMU and a PU, 
the PU is given priority. 


3-6 Apple Computer Confidential 


jenuapyjuog sainduiog addy 


L-€ 


cache request 
for ith ai of line 


cycle, S 


detect line missing 
movein (Ml) request -> MMU 
modified line -> MO Buffer 
translate MI address 
initiate RAM access 
forward word i to PU 
word i -> cache 
word i+1 -> cache 


word i-1 -> cache 
MO request -> MMU 


translate MO address 


Figure 3.4. Cache Miss Timing with TB 


ayIeD BYUL 
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When the last word of the missing line has been stored, moveout processing is 
initiated. This is similar to movein processing: a moveout request is sent to the 
MMU, the address of the line being moved out is translated, RAM access initiated, 
and the contents of the MO buffer stored'in. memory, one word at a time. Note that 
no cache bank conflicts can occur duri transfer to memory, so the moveout 
should complete in 23 cycles. The | sy for the duration of a moveout; if 
either an instruction cache or a da sécurs during this time, the request- 


ing PU will be blocked until the. 
In the absence of MMU ‘translation delays, a load or store 
instruction which causes ie miss i ‘delay:of 8 cycles. This delay is 


called the nominal miss: 
from a miss is called th 
sometimes is less than, the’) 


1 is greater than, but 
-'B misses increase 


issing data line is satisfied after a nomin 
ce will be blocked until the entire line has E 


to a missing line depends, among other thin gs, 
‘een the first and second instructions accessing 


: LDW results in a miss, 
es while waiting for its 
s blocked, the ADD and 
: ocks in its E stage, and the 
ind word of the first LDW is 
e now free, the ADD completes 
€; where it issues a cache request. 


operand word to be forwarded. Be 
LDW which follow also are delayed 
LDW interlocks in its D stage. When 
returned, it completes execution. With tt 
and the second LDW progresses to its 
Because the line is still being moved in ; invalid, this instruction interlocks in 
its E stage until the line becomes valid and its operand word can be read from the 
cache, and so incurs a delay of 14 cycles. “Additional instructions between between 
the first and second LDWs would reduce this delay. 


If the page address of the missing line is not in the Translation Buffer, an 
additional delay of 8 cycles is incurred if the page table block address is contained 
in the Directory Buffer; if it is not, a 16-cycle delay is incurred (see Section 5.) 
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(7 cache ~*~ 


cycle 


ADD 1->4 


LDW @4-59 


request 


1+ line missing 


LDW @4>8 F D ESS SSS § 


F DD D* 


<— pipeline blocked ——» 
E-iB Ee JE E- E> -e 6-3 


buffer (all banks busy) write modified line ----+- : 


move line into cache eee RES “ene eee 


interlocked: 


cache 
request 


word 
z returned 
iné invalid Ss 


EB Ese Eo EB 166 -~S 282 


| blocked ——»f.—__—_—. interlo 
Dp DDODODEEEEE 


Figure 3.5. Cache “ning Example 
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non-privileged 
CDC @Sr create data:cache line i in set addressed by Sr 
FDC @Sr flush dai 
IDC @&r invalic 
IC 3=—@Sr invali 


IICA 

PDC @Sr addressed by Sr 

PIC @Sr line addressed by Sr 
PIC *+Dsp line addressed by *+Dsp 
UDC Sr ssed by Sr (flush & 


jatk line unmodified) . 
iark data cache line addr 


Sr 


ITLB 
RDTX Sr->Dst 


e Control Instri 


e past, caches have 
ures were defined before ¢é 2 
e cache miss rate and mii 


Ss in determining CPU 
il over the cache. The 


help improve performance. The A i ian set provides nine user-state 
instructions for this purpose. There p 
Buffer (TB) control instructions for us by th : 

on a task switch or a task termination. 


‘in flushing the cache and TB 


The maximum instruction execution: the Antares CPU is one instruction 
per cycle (per PU); the actual executi : depends primarily on cache miss 
delays, and to a lesser extent on cache b conflict and pipeline interlock delays, 

and on the relative frequency of multi-cycle instructions. To illustrate, suppose the 
data cache miss ratio is 0.04 misses per access and the average number of data 
accesses per instruction is 0.5: the data cache miss rate, then, is 0.02 misses per 
instruction. If each miss causes a average delay of 11 cycles, then data cache 
misses add 0.22 cycles to the mean instruction execution time. Instruction cache 
misses add additional delay cycles (although instruction cache miss ratios usually 
are lower than data cache miss ratios because of the greater locality of instruction 


references). 
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There are two ways to reduce the impact of cache misses on performance: 
reduce the miss rate, and reduce the miss penalty. The compact Antares instruction 
set helps reduce the instruction cache miss rate, and the 4-way set associative, full 
LRU, cache design helps in miss rate;teduction. Instructions are provided to 
invalidate cache lines: when it is kno -Jine will not be used again, it can be 
marked invalid and least recently use save displacing another line which 
may be referenced again. Modifi ‘are no longer of use (e.g., lines 
popped from the stack) can be £ *. this reduces memory traffic 
delays and helps reduce the m 6 words in a line are to be 
written, a create data cach d.to avoid bringing in the 
original contents of that liné 


Misses can be divided: ; 


ituction can be 


o classes: demand missé 
demand miss occurs wheti an: instruction fetch or and 6 
references a line not in the cache; demand miss timing was di 
section. Antares provides instructions to prefetch lines into the d 
caches; in executing one of these instructions, the PU sends the m 
the cache and..continues. fetching and executing instructions. Wh 


c prefetch misses. A 
d fetch or store 
earlier in this 


generation for a procedure call. The actual 
7 address 1 iS loaded into a 


prefetch miss occurs while the MMU i is a transfer, the prefetch request 


simply is discarded. 


he writing of) a modified data cache 
‘to insure that memory shared between 


An instruction is provided to flush (for 
line to memory. This instruction can be usé¢ 
CPUs is updated properly. 


Much currént-day software, such as that for the Motorola 68000, was not 
developed with a cache in mind, and frequently produces higher than necessary 
miss rates when executed on a later CPU which has a cache. Antares software 
designers have the opportunity to reduce miss rates through careful organization of 
code and data and through the use of invalidate and create line instructions, and to 
reduce the effective miss penalty by prefetching. 
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3.6 Cache Flushing 


Cache lines and TB page entries are tagged with their virtual addresses, so that 
lines and pages in one address space caniot be distinguished from lines or pages 
with the same address in another addr ice. Consequently, in switching from 
one address space to another (see Se¢ he kernel must flush the cache and 
B requires only simple invalidate 
provided to perform these 


operations, and instructions | 
operations. Flushing the data | eC 
lines have to be written to m ‘Fo simpli control and reduce memory 


' aS and uses 
The kernel, 


s easily parallelized, and 
hardware. 
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4.1 Introduction 


edone at compile time. Parallel 
olling PU terminate by executing a 


halt instruction. The controlling PU can y 
initiating a serial activity by executing a walt'instruction. 


address broadcasting. During serial execution, one PU is executing and 
the other are halted. To initiate execution of parallel activities, the controlling PU 
activates one or more waiting (target) PUs via the broadcast instructions 


RSM PUmask resume execution at target's current PC 
or | 
STRT Ri, PUmask start execution at address contained in register 
Ri of the controlling PU 
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Broadcast instructions have a 4-bit field called the PUmask field: bits 0-3 of this 
field correspond to PU numbers 0-3. Broadcast instructions operate on all PUs 
whose PUmask bit is 1. The resume (RSM) instruction causes each target PU — 
each PU specified by its PUmask field —- t¢ resume execution at that PU's current 
Program Counter (PC) address. Hal sume execution immediately. If all 
target PUs are not halted at the tim instruction is executed, the RSM 
instruction blocks; execution on : RSM instruction continues only 
when all target PUs have halt “The start (STRT) instruction is 
similar to the resume instruc: 


store value in register Ri of the cori 
register Rj of target PU 


to halt ‘anal via the SASHGETIOR 
WAIT PUmask 


he PU executing the wait 
‘halts PU execution, and other 
until reactivated by a resume or 


If the PUmask bit corresponding to the 
instruction is set, the instruction unc 
PUmask bits are ignored; the PU remai 
start instruction or an interrupt. If the P it corresponding to the number of 
the PU executing the wait instruction is i¢ PU waits until all the PUs specified 
by PUmask have halted, and then continues execution. In this case, then, the wait 
instruction performs a join operation. 


If the PU's. register contents are no longer useful, the PU should set the 
"registers available” bit in its Status/Control Register (Figure 2.2) via a Set Mode 
instruction prior to halting to indicate that its registers do not have to be saved. In 
recognizing an external interrupt, the Antares interrupt mechanism tries to assign 
processing of the interrupt to a halted PU. The operating system kernel examines 
the "registers available" bit and skips register saving and restoring if that bit is set. 
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active 


SEND R10->R0, 0010B 
SEND R11->R1, 0010B : 
RSM = OO10B 


WAIT 0001B: 


. 
eeeeseeese- Feces eeeaseeeee sees ee ete eee eee ee 8 8 oe Bee bebe ee ee 


‘The execution sequence 

PU 1's program counter 
holds the entry point address of th tl ing: add emulation code. The 
operands of the add are contained in re; “11 of PU 0. 


To initiate parallel execution o 
instructions to transmit the operands. { 
instruction to activate PU 1. PU 0 cont ss in execution until it reaches a point 
where the result of the add is required and then halts by executing a wait instruction 
with the PUmask bit for PU 0 set. When PU 1 finishes its computation, it issues a 
send instruction to return its result to PU 0; if PU 0 is not halted at this time, the 
send instruction blocks until PU 0 does halt. After sending its result to PU 0, PU 1 
returns PU 0 to execution by executing a resume instruction. It then prepares for 
the next add activity by jumping to its entry point and halting. 


activity, PU O executes send! 
1, and then executes a resume 


lI instruction fields in this example, a numeric field terminated by a 'B' indicates a binary 
number: e.g., '0010B' represents 00105. 
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active : = halted > halted 
LDW addr->R1 : : 
STRT R1,1110B —> active : g active : 
JMP_ Ri : : : 


WAIT 1110B 


: (halt) 
—. WAIT 1110B 
PUs 1-3 : : (halt) 
halted”: : 


@eevenreseeeeesr ee es eee eee eee ee ee ee meee eee eee eee eae ee ee eee ee eee eek 


Figure 4.2. S 


The coordination required to returf. 
semaphores, with the advantage that the £ 
need to know which PU invoked it. 


Figure 4.2 shows the initiation and tet 
four processors execute the same code. | litiate this activity, PU O executes a 
start instruction with bits set in the PU “field for PUs 1, 2, and 3; this starts 
these PUs executing at the specified address. PU 0 starts its own execution of the 
activity via a jump instruction. 


could be implemented via 
émulation activity would not 


i of an SIMD activity in which all 


While all PUs execute the same code, their execution times for this activity may 
differ because of data differences and because they incur different delays, such as 
cache misses. All four PUs, on completing execution of this activity, execute the 
same wait instruction, WAIT 1110B. This causes PUs 1, 2, and 3 to halt, since their 
own PU number is specified in the PUmask field, and causes PU 0 to suspend 
execution until each of the other PUs has halted. Thus, synchronization at the end 
of this activity requires only a single instruction. 
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input output 


semaphore 0 
semaphore 2 


Figure 4.3. Sema Platforms lined Execution 


4.3 Semaphores 


- Broadcast instructions sed to transmit data from af 
PUs, and to activate halted PUs, Semaphore operations are ti 
between and coordinate the activities of active PUs. 


tive PU to halted 
‘(oO communicate 


‘by concatenating the dis 
iddress. ise the discussio i 


set to empty. If F initially is set to ful 
F is set to empty. A load from semap 
from that location and sets F to empty, 
initially is empty, the PU executing 
Semaphore flags also are contained in § 


id is blocked until F is set to full. 
is tsa clusters changes both the 


examples. Semaphores are used to transmit data between executing PUs, to 
control access to data, and to control execution of "critical sections" of code. In the 
example of Figure 4.1, the result of the floating point add operation performed by 
PU 1 could have been returned to PU 0 via a semaphore, saving an instruction in 


2while the contents of these locations may be read or written by register-addressing or base-plus- 
displacement-addressing load/store instructions, only the direct-addressing instructions peor 
semaphore operations. 
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LDW s->Rl get lock 


etna, : CMP R1:0 end of list? 


Rl->s yes: unlock 
exit and exit 


no: update, 
unlock, and 
continue 


(b) acces: 


Data Locking Via a Semaphore 


ple, PU 0 reads an operan 
value in semaphore 0 viz 
read the previous value, the F flag of 
store instruction will block. When. 
instruction), the F flag for semapho 


. If PU 1 has not yet 
‘be set to full, and the 


ccess to data. Suppose all four 
queue of work maintained in the 


Semaphores also are used as locks 
PUs are executing in SIMD mode, ope 
form of linked list (Figure 4.4(a)). Sema; S is used both to lock the queue and 
to hold the address of the next element in thé queue. When a PU is ready to operate 
on the next element, it executes the access:Sequence shown in Figure 4.4(b). If the 
queue is unlocked, the F bit associated with semaphore s will be set to full. When a 
PU executes the LDW instruction of the access sequence, the contents of semaphore 
location s are returned to the PU and the F bit is cleared, blocking access to the 
queue by other PUs and so locking it. The queue is unlocked when the PU 
executes a store to semaphore location s, either after recognizing that the end of the 
queue has been reached or after removing an element and advancing the queue head 
to the next element. 
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activity initiation via semaphores. Since semaphores can be used to 
pass addresses, as well as data, it may seem that the broadcast instructions are 
redundant. However, activity initiation-termination sequences based on sema - 
phores typically require execution of 6#1Q:instructions per PU, depending on the 
scheme used. This overhead impacts} sance both directly and indirectly (by 
increasing code space). For exam {sequence of 30 instructions can be 
divided into two independent ac tructions each. If the cost of 
parallelizing these activities is a ions, the performance gain? is 
only 30/23 = 1.3X; if th additional instructions, the 
performance gain is 1.75X,. ¢ parallelization overhead 
or, equivalently, acco ion than is achievable 
through semaphores alo 


4.4 Deadlock Detectior 


The state of each PU is represented by two bits in the Glob 
(Special Register 13). Both bits are cleared if the PU is executin 
bits is set if.th iting for a semaphore flag to aon State, and 


: A pUs kh have not yet halted. If one or the other of 
dlock situation is assumed. to have occurred and 


3 Assuming an execution time of one cycle per instruction. 


Apple Computer Confidential 4-7 


5. Address Tran 


SS space number as p 
g in the next section 
The operating system 


of the nts line tag; see the discussion of task 
ye address space model (Figure 5.1) is very 


for the kernel, and the 
parts of the operating 


foes not use Translation Buffer 
ness of a relatively small TB}. 


(TB) entries; this helps improve th 
Separate prefix addresses for user state’ ‘stem state provide separate direct- 
address space (and semaphores) for the’ user and the kernel. Lines in this non- 
pageable kernel region are cached in the same way as pageable region lines. The 
actual amount of real memory allocated to the kernel is determined by its needs; the 
kernel, at system startup time, assigns the real memory it doesn’t need in this 1- 
MW region to allocatable page space. The operating system may use a pageable 
region, in addition to the hardware mapped kernel region; this region, however, is 
not specified by the hardware architecture. | 


1 Antares resembles MIPS (the MIPS Computer nyslens RISC CPU) in this regard: see 
DeMoney et al [1986]. 
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1024MW 


OX3FFFFFFF 


OX3FF00000 


1-64 MW 


1023 page mapped 


to OXOOOFFFFF 


0x00000000 0x00000000 


page address with the line index to eB 
corresponding line into the cache. 


0 
: WI 
: (4b) 


| a. 


(7b) 


page table page table line index word index 
directory block index in page in line 
index 


Figure 5.2. Virtual Address Format 
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In the case of direct addressing, a virtual address is formed by concatenating the 
current prefix address with the instruction's displacement field. The current prefix 
address is obtained from one of the Special Register 6 pair, S6[0] or S6[1], as 
determined by the current cluster numbe e ting. When a trap or interrupt occurs, 
the current cluster number is set to g S6[0}. The prefix address in this 
register typically is of the form 0x3 : fe 'zzz' is an offset from the start of 
the kernel region to the start of ‘ yhore and directly-addresssable 
locations. 


The page size in Antare: 
small enough to provide 
memory configuration 4 
miss rates. Selection of 
object sizes was ne 


is). This size is considered 
“pages in a minimum real 
ible Translation Buffer 
correspond to typical 


€ enough to obtain rea! 
page size that would 


translation of a virtual pag 
ing a two- level page table 


Page Table Directory Origin (PTD 
This register is loaded via a Move Special i 
gment3 of the virtual address 
i O, a flag in the directory entry 


space: if no pages in a segment are all 
the address space corresponds to 


is set to invalid. The segment at the h 
the kernel region. If any page in a segme tocated, the directory entry contains 
the starting address of a page table block of $12 entries, one entry for each page in 
the segment. A page table block entry comprises a set of flags, including an invalid 
bit, and a real address field. If a real page is bound to the virtual page represented 


2The VAX, with a page size of 512 bytes, experiences very high TLB miss rates. Clark and Emer 
[1985] report TLB miss rates in the vicinity of 0.033 misses/instruction for a VAX 11/780 with a 
128-entry TLB (these rates however, represent operation in a multi-user environment; also, the 
split design of the VAX TLB results in a relatively high miss rate for its size). 


3A segment is defined as the region in a virtual address space represented by a directory entry; it 
has no architectural definition beyond that. 
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Page Table 
Directory 
(1024 words) 


Page Table Directory 
‘igin (PTDO) Register 


Directory 
entry address 


Directory entry 
(PTBO = Page Table 
Block Origin) 


Page Table block 
entry address 


__ Page Table 
block entry real page address 
real line address real page address 


Figure 5.4. Virtual to Real Address Translation 
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by an entry, the invalid bit is not set and the entry provides the real page address 
together with access permission flags; the latter are forwarded to the cache for 
inclusion in the cache line tag. If either the invalid bit in the directory entry or in the 
page table block entry is set, a page fault interrupt is initiated by the MMU. 


The real page address in a pag 
(node) number and a page address 
MMU uses the CPU number to. 


try is the concatenation of a CPU 
mory connected to that CPU. The 


Figure 5.4. First, the d 


done by hardware. This i i B the directory index 
part of the address and treating i fess as a real memory 


provides the page table block origin a : 
index (BI) field of the virtual addre 
k entry. The page table block entry provid 


5.4 The MMU 


The translation of a virtual address to 
The steps in translation of a user region 
involved, are illustrated in Figure 5.5. 


address is performed by the MMU. 
al address, and the MMU elements 


When a cache miss occurs, the virtual address of the missing line is sent to the 
MMU. The MMU extracts the virtual page address and searches the Translation 
Buffer (TB) for it. The TB is a small, fully associative*, cache which holds 
translations for the n most recent MMU page references. in the initial imple - 
mentation of Antares, n is expected to be 16. A TB entry contains a virtual page 


4ive., any virtual page address can map into any entry: in a set associative TB, a page entry can 
map only into a given set. 
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cache miss 


IV, 


no 1s VPA in TB? 
yes 


get RPA from TB 
update TB 

form real line address 

read line 


Translation Buffer 
8-16 entries, fully associative 


° Directory/Block 
dress & Entry Registers 


‘form BEA 
read block entry 
from memory: valid? 


page fault 


notation 
VPA -_ virtual page address 
get RPA from entry RPA -_ real page address 
replace TB entry DEA -_ directory entry address 
form real line address BOA - block origin address 
read line BEA - block entry address 


- Translation Buffer 
Directory Entry Buffer 


Figure 5.5. Address Translation: MMU Operations and Elements 
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address, the corresponding real page address, and a set of flags which include flag 
bits from the page table block entry. If the virtual page address is found in the TB 
(a TB hit), the real page address is obtained from the entry, the real address of the 
line is formed, and a line read request ent 0 (local or remote) memory. The TB 
entry is established as the MRU (most# used) entry, either by reordering TB 
entries (there is ample time to do this ‘ing for the data transfer to begin) or 
by setting appropriate flag bits ts small, it should provide good 
performance because of its full: ization and the relatively large 


If the virtual page add i , the MMU forms the address 
of the directory entry b 


lines for one code segment and one data st gment 
ident, and there may be several data segments in 


; substantially i increas - 
uld have increased the 
{U. For these reasons, 
k entries from memory as 
isters. The small directory 
-ads required for directory 
halso miss in the TB incur an 
e page table block entry. 


the MMU’ is designed to read directo 
needed, using its own entry address at 
buffer significantly reduces the numt 
entries, so that the majority of cache 
added penalty of just the 8 cycles requir 


it is checked to see if it is valid; if 
“the entry is valid, it replaces the least 


When a directory entry is read from ni 
it is not, a page fault interrupt is generated 
recently used entry in the directory buffer... 


The block origin address obtained from the directory buffer or read from 
memory is concatenated with the block index to form the block entry address, and 
the entry is read and checked. If invalid, a page fault interrupt is generated. Other - 
wise, the real page address from the entry is used to form the real line address and a 
read request is initiated for the line. A new TB entry is constructed and inserted in 
the TB in place of the least recently used entry. . 


Page table directories and blocks reside in the kernel region but are accessed by 
the MMU using real memory addresses, not kernel region addresses. 
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5.5 Page Table Entry Format 


A Page Table Directory entry contains a valid bit and, for a valid entry, the real 
address of the first word of the Page Table Block associated with that entry. A 
Page Table Block entry comprises the fi i 


length 
in bits 
18 


ed to system components in additi 
as the NuBus 


valid/invalid flag 


defined later. 


5.6 Non-Cacheable Pages 


‘as non-cacheable (by having the kernel 
set the appropriate tag bit in the page table entry for that page). When a load or 
store access is made to a word in a non-cacheable page, that word is transferred 
directly between the designated register and local or remote memory. Non- 
cacheable pages have two principle uses: memory-mapped IO, and inter-CPU 
messages, both of which involve bypassing the data cache. However, it is possible 
to bypass the instruction cache by declaring a code page non-cacheable, which is 
useful in debugging and testing. 


Processing of a load or store access to a non-cacheable page is much like miss 
processing. The cache, on receipt of the load or store request, searches its tag store 
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for the associated address and, when the address is not found, sends a movein 
request to the MMU, selects a line for replacement, and prepares to invalidate the 
selected line. The MMU performs address translation in the usual way and obtains 
the page table entry for the page conta . the referenced word, either from the 
Translation Buffer (TB) or, in the e TB miss, from the page table block. 
The MMU examines the page table. s.and, on recognizing that the page 
is non-cacheable, signals the cac Jidation of the line selected for 
replacement, and initiates a one 
of a load or store to a non-¢: 
contained in the TB, and ei 


s, if the page address is 
(depending on whether or 


Srrupt-on-write tag bit 
‘© perform an address 
yan address range that is 


comparison to determine if an IPB 
an interrupt. 


defined as a message range and so req 


As an example, a possible mapping: 
an address space of CPU i's into which 
jis called the outbox page for CPU j. Th 
memory address in CPU j's memory called the'inbox page; i's page table entry for 
this page has the system, non-cacheable, ‘and interrupt-on-write bits sets. CPU i 
sends a message to CPU j by writing a word containing the message operation code 
to word address 8*i in the outbox page for CPU j, where 8 is an constant 
determined by the operating system. The MMU performs the address translation 
and initiates an IPB transfer with interrupt; the interrupt is presented to the 
receiving CPU after the message word has been stored in j's memory. j becomes 
disabled on recognition of the interrupt; 's kernel retrieves the message address 


ites to send a message to CPU 
ox page for CPU / maps into a real 


Sif the page is in local memory. Additional cycles will be required if the page is in remote 
memory and the access is effected via an IPB transfer; exact timing is yet to be determined. 
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outbox page 
CPU 1 


outbox page 
Cc 


system space 


he message from the inbox, and 
e€ same way any other message 
PU system has n—1 outbox pages 
*j maps to the same real page in js 


from the Interrupt Argument Register, 
sends an acknowledgement to CPU i (in 
is sent). In this scheme, every CPU in 
and 1 inbox page; every outbox for C 
memory. 


If CPU i sends a message to CPU j and CPU j is disabled for interrupts, CPU 
j's MMU will perform the write operation and queue the interrupt until the CPU 
enables interrupts and the message interrupt can be recognized. Only one message 
interrupt can be queued; if some other CPU, say CPU k, attempts to send a 
message to CPU j while CPU j has a queue message interrupt, CPU k's message 
will be rejected. This rejection is effected via a synchronous negative response to 
CPU k's IPB transfer; it blocks completion of the STW instruction which initiated 
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the message, and it cause a "message rejected" trap to be generated on the PU 
which issued the STW instruction. 


The kernel decides how to deal with 
ration, it simply may reinitiate exe 
configuration, it may use some adapti\ 
before another attempt to send a 
the executing task if the delay is 


ejected messages. In a small configu- 
f the STW instruction. In a large 
to determine an appropriate delay 
‘be made, and perhaps reschedule 
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iOn of a particular instruc - 
hat instruction. Traps can 
struction) or as exceptions, 


operation codes, and rejected inter-CPU 


Interrupts usually are caused by ev xternal to the CPU and may be 
processed by any PU, except for inter-PU interrupts generated by execution of an 
INT or RES (restart) instruction, which are processed by the PU(s) specified in the 
PUmask field of the INT or RES instruction. All other interrupts can be processed 
by any PU; the (hardware) interrupt handler assigns processing of an interrupt to a 
halted PU whenever possible so that interrupt processing can be done concurrently 
with user/system task execution. An inter-CPU, or message, interrupt occurs when 
one CPU executes a store instruction which causes a word or a line to be written to 
a page marked "interrupt-on-write" mapped into the address space of another CPU 
(see Section 5.7). Antares provides a pair of Event Counters (global special 
registers $10 and $11) which, under control of the Event Selection Register (S12), 
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INTERRUPTS TRAPS 


Reset Arith. Overflow/Divide By 0* 
Machine Check legal Operation/Taken Branch* 
Restart ata Access Violation 
Power/Temp. ruction Access Violation 


Inter-PU = (INT instr.) 
Inter-CPU (message) 
Event Counter Overflow” 
External 


* individually enabled tra 


C Save Queue, and 
. traps and interrupts 


except machine reset, this is word a 
or interrupt number (0-31). (Machine r nsfer to location 0.) The 
kernel's (software) interrupt handler dé i Her or not it needs to save 
e PU's "registers available" 
bit in the Status Save Register. Any ne ‘dination with other instances of 


kernel execution can be done via semaphi 


Interrupt/trap processing is controll a single (CPU-wide) master enable 
flag (in the Interrupt Argument Register): and individual PU enable flags (in each 
PU's Status/Control Register). The possible states of these two flags and the 
corresponding interpretations are shown in Figure 6.2. When an interrupt can be 
recognized (CPU enable flag set and a PU enable flag set for at least one PU), a PU 
is assigned to process the interrupt, the interrupt enable flag is cleared, and the trap 
enable flag of the selected PU is cleared. When a trap (or an interrupt generated by 
an INT instruction) is recognized, only the PU's trap enable flag is cleared. Once 
cleared, interrupt and trap enable flags remain cleared until explicitly set. However, 
the Reset and Machine Check interrupts override the state of the interrupt and trap 
enable flags. If an interrupt is presented while the CPU is disabled for interrupts, it 
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CPU enabie | PU enable 


1 PU can recognize an interrupt or a trap 


virtual address; this does not suffice to: 
a line with the same address in anothei 
for the TB.) There are two approaches 


First, information can be added to the:{he cache tag to uniquely identify the line: 
this information could take the form of an address space number (ASN) or the 
distinguishing part of the real line address (which substantially increases tag storage 
space). While adding an ASN to the tag is less demanding in terms of tag storage, 
the MMU becomes much more complicated. If address space B is active and a line 
from B replaces a modified line of address space A, the MMU has to retrieve the 
page tables of address space A in order to translate the virtual address of the 
modified line prior to its moveout. 


Second, the cache can be emptied on an address space switch so that lines from 
different address spaces cannot be in the cache at the same time. This has two 
performance costs: the direct cost of the cycles required to carry out the flush and 
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invalidate operation (including the time required to write modified lines to memory), 
and the indirect cost of discarding lines which, when the original address space is 
returned to execution, will have to be brought back in to the cache. 


Antares uses the second approach 
switch. This approach is chosen for | 
Because of the small cache size of 1 
flushing is not expected to be s 
back without flushing (A > ke: 
address space A remaining in 
be small, so that flushing de 
a task switch. Cache ff ( 
Translation Buffer (ITL i irecto ffer as well as the 
Translation Buffer. 


cache is flushed on an address space 
implicity and to minimize tag space. 
ign, the performance penalty of 
from address space A to B and 


Later implementations of ntares architecture, with lar 
will adopt a different approach to this problem. The architectura 
low, so different i lementations should not present a compatibili 


ches, probably 
lity is very 
ib] lem. 
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